Anna Urbala PD5

Załadowanie modeli

In [1]:
import dalex as dx
import pandas as pd
import numpy as np
import pickle
In [2]:
rf = pickle.load(open("../../../../WB-XAI-Projekt/RF_model", "rb"))
In [3]:
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split


# Wczytanie i przygotowanie danych 
full_data = pd.read_csv("hotel_bookings.csv")
full_data["agent"] = full_data["agent"].astype(str)
treshold = 0.005 * len(full_data)
agents_to_change = full_data['agent'].value_counts()[full_data['agent'].value_counts() < treshold].index
full_data.loc[full_data["agent"].isin(agents_to_change), "agent"] = "other"

countries_to_change = full_data['country'].value_counts()[full_data['country'].value_counts() < treshold].index
full_data.loc[full_data["country"].isin(countries_to_change), "country"] = "other"


# Określenie cech uwzględnionych w modelu
num_features = ["lead_time", "arrival_date_week_number",
                "stays_in_weekend_nights", "stays_in_week_nights", 
                "adults", "previous_cancellations",
                "previous_bookings_not_canceled",
                "required_car_parking_spaces", "total_of_special_requests", 
                "adr", "booking_changes"]

cat_features = ["hotel", "market_segment", "country", 
                "reserved_room_type",
                "customer_type", "agent"]

features = num_features + cat_features

# Podział na zmienne wyjaśniające i target
X = full_data.drop(["is_canceled"], axis=1)[features]
y = full_data["is_canceled"]

categorical_names = {}
for feature in cat_features:
    col = X[[feature]]
    cat_transformer = SimpleImputer(strategy="constant", fill_value="Unknown")
    col = cat_transformer.fit_transform(col)
    X[feature] = col
    le = LabelEncoder()
    le.fit(X[[feature]])
    X[[feature]] = le.transform(X[[feature]])
    categorical_names[feature] = le.classes_

categorical_names
# Preprocessing
num_transformer = SimpleImputer(strategy="constant")

preprocessor = ColumnTransformer(transformers=[("num", num_transformer, num_features)],
                                remainder = 'passthrough')

for feature in num_features:
    X[feature] = X[feature].astype(float)

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2, random_state=42)
/home/anna/.local/lib/python3.6/site-packages/sklearn/utils/validation.py:63: DataConversionWarning:

A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().

/home/anna/.local/lib/python3.6/site-packages/sklearn/utils/validation.py:63: DataConversionWarning:

A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().

/home/anna/.local/lib/python3.6/site-packages/sklearn/utils/validation.py:63: DataConversionWarning:

A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().

/home/anna/.local/lib/python3.6/site-packages/sklearn/utils/validation.py:63: DataConversionWarning:

A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().

/home/anna/.local/lib/python3.6/site-packages/sklearn/utils/validation.py:63: DataConversionWarning:

A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().

/home/anna/.local/lib/python3.6/site-packages/sklearn/utils/validation.py:63: DataConversionWarning:

A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().

In [4]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from xgboost import XGBClassifier

def train_model_pipe(model):
    model_pipe = Pipeline(steps=[('preprocessor', preprocessor),
                                 ('model', model)])
    model_pipe.fit(X_train, y_train)
    return model_pipe

lr = train_model_pipe(LogisticRegression(random_state=42,n_jobs=-1))
dt = train_model_pipe(DecisionTreeClassifier(random_state=42))
xgb = train_model_pipe(XGBClassifier(random_state=42, n_jobs=-1))
/home/anna/.local/lib/python3.6/site-packages/xgboost/sklearn.py:888: UserWarning:

The use of label encoder in XGBClassifier is deprecated and will be removed in a future release. To remove this warning, do the following: 1) Pass option use_label_encoder=False when constructing XGBClassifier object; and 2) Encode your labels (y) as integers starting with 0, i.e. 0, 1, 2, ..., [num_class - 1].

[14:18:07] WARNING: ../src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
In [5]:
explainer_rf = dx.Explainer(rf, X_train, y_train, label = "Random Forest")
explainer_lr = dx.Explainer(lr, X_train, y_train, label = "Logistic Regression")
explainer_dt = dx.Explainer(dt, X_train, y_train, label = "Decision Tree")
explainer_xgb = dx.Explainer(xgb, X_train, y_train, label = "XGBoost")
Preparation of a new explainer is initiated

  -> data              : 95512 rows 17 cols
  -> target variable   : Parameter 'y' was a pandas.Series. Converted to a numpy.ndarray.
  -> target variable   : 95512 values
  -> model_class       : sklearn.ensemble._forest.RandomForestClassifier (default)
  -> label             : Random Forest
  -> predict function  : <function yhat_proba_default at 0x7fc6846309d8> will be used (default)
  -> predict function  : Accepts only pandas.DataFrame, numpy.ndarray causes problems.
  -> predicted values  : min = 0.0, mean = 0.371, max = 1.0
  -> model type        : classification will be used (default)
  -> residual function : difference between y and yhat (default)
  -> residuals         : min = -0.943, mean = -0.00218, max = 0.957
  -> model_info        : package sklearn

A new explainer has been created!
Preparation of a new explainer is initiated

  -> data              : 95512 rows 17 cols
  -> target variable   : Parameter 'y' was a pandas.Series. Converted to a numpy.ndarray.
  -> target variable   : 95512 values
  -> model_class       : sklearn.linear_model._logistic.LogisticRegression (default)
  -> label             : Logistic Regression
  -> predict function  : <function yhat_proba_default at 0x7fc6846309d8> will be used (default)
  -> predict function  : Accepts only pandas.DataFrame, numpy.ndarray causes problems.
  -> predicted values  : min = 1.08e-11, mean = 0.371, max = 1.0
  -> model type        : classification will be used (default)
  -> residual function : difference between y and yhat (default)
  -> residuals         : min = -0.99, mean = -0.00211, max = 1.0
  -> model_info        : package sklearn

A new explainer has been created!
Preparation of a new explainer is initiated

  -> data              : 95512 rows 17 cols
  -> target variable   : Parameter 'y' was a pandas.Series. Converted to a numpy.ndarray.
  -> target variable   : 95512 values
  -> model_class       : sklearn.tree._classes.DecisionTreeClassifier (default)
  -> label             : Decision Tree
  -> predict function  : <function yhat_proba_default at 0x7fc6846309d8> will be used (default)
  -> predict function  : Accepts only pandas.DataFrame, numpy.ndarray causes problems.
  -> predicted values  : min = 0.0, mean = 0.369, max = 1.0
  -> model type        : classification will be used (default)
  -> residual function : difference between y and yhat (default)
  -> residuals         : min = -0.941, mean = 7.9e-20, max = 0.958
  -> model_info        : package sklearn

A new explainer has been created!
Preparation of a new explainer is initiated

  -> data              : 95512 rows 17 cols
  -> target variable   : Parameter 'y' was a pandas.Series. Converted to a numpy.ndarray.
  -> target variable   : 95512 values
  -> model_class       : xgboost.sklearn.XGBClassifier (default)
  -> label             : XGBoost
  -> predict function  : <function yhat_proba_default at 0x7fc6846309d8> will be used (default)
  -> predict function  : Accepts only pandas.DataFrame, numpy.ndarray causes problems.
  -> predicted values  : min = 1.41e-06, mean = 0.369, max = 1.0
  -> model type        : classification will be used (default)
  -> residual function : difference between y and yhat (default)
  -> residuals         : min = -0.982, mean = -5.38e-05, max = 0.997
  -> model_info        : package sklearn

A new explainer has been created!

Dla wybranych zmiennych ze zbioru danych policz Partial Dependence Profiles (PDP)

In [6]:
pdp_rf=explainer_rf.model_profile()
pdp_lr = explainer_lr.model_profile()
pdp_dt = explainer_dt.model_profile()
pdp_xgb = explainer_xgb.model_profile()
Calculating ceteris paribus: 100%|██████████| 17/17 [00:04<00:00,  3.89it/s]
Calculating ceteris paribus: 100%|██████████| 17/17 [00:00<00:00, 24.32it/s]
Calculating ceteris paribus: 100%|██████████| 17/17 [00:00<00:00, 40.93it/s]
Calculating ceteris paribus: 100%|██████████| 17/17 [00:00<00:00, 17.80it/s]
In [7]:
pdp_rf.plot([pdp_lr, pdp_dt, pdp_xgb])

Generalnie widać, że partial dependence profile jest porównywalny dla wszystkich zmiennych u wszystkich modeli poza regresją logistyczną (która ma najgorsze wyniki i odbiegała już ostatnio - jest to związane ze specyfiką tego modelu). Można zatem uznać, że 3 nasze modele są zgodne. Są czasem różnice (np. dla previous_cancellations XGB ma predykcję wyższą od reszty o ok 0.2), ale generalnie predykcje mają podobny kształt, więc te różnice wpływają już głównie na skuteczność modelu.

Dla wybranych zmiennych ze zbioru danych policz Accumulated Local Dependence (ALE).

In [8]:
ale_rf = explainer_rf.model_profile(type = 'accumulated')
ale_lr = explainer_lr.model_profile(type = 'accumulated')
ale_dt = explainer_dt.model_profile(type = 'accumulated')
ale_xgb = explainer_xgb.model_profile(type = 'accumulated')
Calculating ceteris paribus: 100%|██████████| 17/17 [00:03<00:00,  4.27it/s]
Calculating accumulated dependency: 100%|██████████| 17/17 [00:02<00:00,  6.77it/s]
Calculating ceteris paribus: 100%|██████████| 17/17 [00:00<00:00, 25.23it/s]
Calculating accumulated dependency: 100%|██████████| 17/17 [00:02<00:00,  6.73it/s]
Calculating ceteris paribus: 100%|██████████| 17/17 [00:00<00:00, 49.12it/s]
Calculating accumulated dependency: 100%|██████████| 17/17 [00:02<00:00,  6.00it/s]
Calculating ceteris paribus: 100%|██████████| 17/17 [00:00<00:00, 20.09it/s]
Calculating accumulated dependency: 100%|██████████| 17/17 [00:02<00:00,  5.91it/s]
In [9]:
ale_rf.plot([ale_lr, ale_dt, ale_xgb])

Generalnie kształty wyglądają podobnie do modeli PDP, co potwierdza hipotezę o zgodności naszych 3 modeli. Porównajmy jednak jeszcze PDP i ALE naszego głównego modelu.

In [10]:
ale_rf.result['_label_'] = "ALE"
pdp_rf.result['_label_'] = "PDP"
In [11]:
ale_rf.plot(pdp_rf)

Krzywe są równoległe do siebie i leżą bardzo blisko (największe różnice w predykcji są dla country ~ 0.15 i lead_time ~ 0.2). Ogólnie nie ma powodów do niepokoju, nasze profile są poprawne i powinny dawać prawidłowe podsumowania.

Uwaga

Przy naszym modelu wywołanie pdp.plot(geom="profiles") było bardzo złym pomysłem. Całość rysowała się około 10 minut, po czym dostaliśmy po prostu obrazki z szarym tłem, dlatego postanowiłam nie załączać outputu :(. plot

In [ ]: